F1 Score (f1_score)#

The F1 score (a.k.a. F-measure) summarizes performance on the positive class by combining:

  • Precision: when we predict positive, how often are we correct?

  • Recall: of all actual positives, how many did we find?

It’s especially common when:

  • the positive class is rare (class imbalance)

  • false positives and false negatives both matter (roughly equally)

Goals#

  • Derive the F1 formula from the confusion matrix.

  • Build a from-scratch NumPy implementation (binary + multiclass averages).

  • Visualize how thresholding changes precision, recall, and F1.

  • Use F1 to tune a simple logistic regression classifier.

Quick import#

from sklearn.metrics import f1_score
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score as sk_f1_score
from sklearn.metrics import precision_score as sk_precision_score
from sklearn.metrics import recall_score as sk_recall_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
import sklearn
import plotly

print('numpy :', np.__version__)
print('sklearn:', sklearn.__version__)
print('plotly:', plotly.__version__)
numpy : 1.26.2
sklearn: 1.6.0
plotly: 6.5.2

1) Confusion matrix → precision, recall#

Assume binary classification:

  • true label: \(y \in \{0, 1\}\) (1 = positive)

  • predicted label: \(\hat{y} \in \{0, 1\}\)

The confusion matrix counts:

\(\hat{y}=1\)

\(\hat{y}=0\)

\(y=1\)

TP

FN

\(y=0\)

FP

TN

From these:

\[ \text{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\qquad \text{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]
  • Precision asks: how noisy are our positive predictions?

  • Recall asks: how many positives did we miss?

# A tiny example
y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 0, 0, 0])

tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
tn = np.sum((y_true == 0) & (y_pred == 0))

tp, fp, fn, tn
(2, 1, 1, 4)

2) The F1 score#

The F1 score is the harmonic mean of precision and recall:

\[ F_1 = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}} = \frac{2\text{precision}\,\text{recall}}{\text{precision}+\text{recall}} \]

Substituting the confusion-matrix definitions gives a very useful form:

\[ F_1 = \frac{2\,\mathrm{TP}}{2\,\mathrm{TP} + \mathrm{FP} + \mathrm{FN}} \]

Key intuition:

  • Harmonic mean punishes imbalance: if precision is high but recall is near zero (or vice versa), \(F_1\) is near zero.

  • True negatives do not appear in the formula. That’s great when negatives are abundant (imbalance), but it can also hide poor performance on the negative class.

A generalization is the \(F_\beta\) score:

\[ F_\beta = (1+\beta^2)\,\frac{\text{precision}\,\text{recall}}{\beta^2\,\text{precision}+\text{recall}} \]
  • \(\beta>1\) emphasizes recall

  • \(\beta<1\) emphasizes precision

# Harmonic mean vs arithmetic mean
ps = np.linspace(0.001, 0.999, 400)
r_fixed = 0.2

f1 = 2 * ps * r_fixed / (ps + r_fixed)
am = 0.5 * (ps + r_fixed)

fig = go.Figure()
fig.add_trace(go.Scatter(x=ps, y=f1, mode='lines', name='F1 (harmonic mean)'))
fig.add_trace(go.Scatter(x=ps, y=am, mode='lines', name='Arithmetic mean', line=dict(dash='dash')))
fig.update_layout(
    title='Same recall, changing precision: harmonic vs arithmetic mean',
    xaxis_title='Precision',
    yaxis_title='Score',
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()
# F1 as a function of precision and recall
precision_grid = np.linspace(0, 1, 201)
recall_grid = np.linspace(0, 1, 201)
P, R = np.meshgrid(precision_grid, recall_grid)

den = P + R

F1 = np.zeros_like(den, dtype=float)
np.divide(2 * P * R, den, out=F1, where=den != 0)

fig = px.imshow(
    F1,
    x=precision_grid,
    y=recall_grid,
    origin='lower',
    aspect='auto',
    labels={'x': 'Precision', 'y': 'Recall', 'color': 'F1'},
    title='F1 surface (heatmap) over precision/recall',
)
fig.update_layout(coloraxis_colorbar=dict(tickformat='.2f'))
fig.show()

3) NumPy implementation (from scratch)#

Below is a minimal implementation that mirrors common sklearn.metrics.f1_score behavior:

  • binary F1 via confusion-matrix counts

  • safe handling of zero division (when there are no predicted positives, or no true positives)

  • multiclass averages: macro, micro, weighted

Convention:

  • when a denominator is zero, we return zero_division (default 0.0)

def _as_1d(a):
    a = np.asarray(a)
    return a.ravel()


def _safe_divide(num, den, zero_division=0.0):
    num = np.asarray(num, dtype=float)
    den = np.asarray(den, dtype=float)

    out = np.full(np.broadcast(num, den).shape, float(zero_division), dtype=float)
    np.divide(num, den, out=out, where=den != 0)
    return out


def confusion_counts_binary(y_true, y_pred, *, pos_label=1):
    y_true = _as_1d(y_true)
    y_pred = _as_1d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    yt = y_true == pos_label
    yp = y_pred == pos_label

    tp = np.sum(yt & yp)
    fp = np.sum(~yt & yp)
    fn = np.sum(yt & ~yp)
    tn = np.sum(~yt & ~yp)

    return tp, fp, fn, tn


def precision_recall_f1_from_counts(tp, fp, fn, *, zero_division=0.0):
    precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
    recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
    f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)
    return precision, recall, f1


def f1_score_binary(y_true, y_pred, *, pos_label=1, zero_division=0.0):
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=pos_label)
    _, _, f1 = precision_recall_f1_from_counts(tp, fp, fn, zero_division=zero_division)
    return float(f1)


def f1_score_multiclass(y_true, y_pred, *, labels=None, average='macro', zero_division=0.0):
    '''Multiclass/single-label F1 via one-vs-rest counts.

    average: {'macro','micro','weighted', None}
    '''

    y_true = _as_1d(y_true)
    y_pred = _as_1d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))
    labels = np.asarray(labels)

    tps = []
    fps = []
    fns = []
    supports = []

    for lab in labels:
        tp = np.sum((y_true == lab) & (y_pred == lab))
        fp = np.sum((y_true != lab) & (y_pred == lab))
        fn = np.sum((y_true == lab) & (y_pred != lab))

        tps.append(tp)
        fps.append(fp)
        fns.append(fn)
        supports.append(np.sum(y_true == lab))

    tps = np.asarray(tps)
    fps = np.asarray(fps)
    fns = np.asarray(fns)
    supports = np.asarray(supports)

    per_class_f1 = _safe_divide(2 * tps, 2 * tps + fps + fns, zero_division=zero_division)

    if average is None:
        return labels, per_class_f1

    average = str(average).lower()
    if average == 'macro':
        return float(np.mean(per_class_f1))
    if average == 'weighted':
        w = _safe_divide(supports, supports.sum(), zero_division=0.0)
        return float(np.sum(w * per_class_f1))
    if average == 'micro':
        tp = tps.sum()
        fp = fps.sum()
        fn = fns.sum()
        return float(_safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division))

    raise ValueError("average must be one of: 'macro', 'micro', 'weighted', None")
# Quick sanity checks vs sklearn
y_true = rng.integers(0, 2, size=200)
y_pred = rng.integers(0, 2, size=200)

ours = f1_score_binary(y_true, y_pred)
sk = sk_f1_score(y_true, y_pred, zero_division=0)
print('binary  f1: ours=', ours, 'sklearn=', sk)

y_true_mc = rng.integers(0, 3, size=300)
y_pred_mc = rng.integers(0, 3, size=300)

for avg in ['macro', 'micro', 'weighted']:
    ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
    sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
    print(f"multiclass {avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")
binary  f1: ours= 0.5539906103286385 sklearn= 0.5539906103286385
multiclass macro   : ours=0.335908 sklearn=0.335908
multiclass micro   : ours=0.336667 sklearn=0.336667
multiclass weighted: ours=0.336466 sklearn=0.336466

4) Thresholding: why F1 depends on the decision rule#

Many classifiers output a score or a probability \(\hat{p}(y=1\mid x)\).

To produce hard labels we pick a threshold \(t\):

\[ \hat{y}(t) = \mathbb{1}[\hat{p} \ge t] \]

Changing \(t\) changes FP/FN, therefore precision/recall, therefore F1.

A common way to use F1 for optimization is to choose \(t\) (and other hyperparameters) to maximize validation-set F1:

\[ t^* \in \arg\max_{t\in[0,1]} F_1\bigl(y,\ \mathbb{1}[\hat{p}\ge t]\bigr) \]

This is practical because:

  • F1 is not differentiable in the model parameters (it jumps when a single point crosses the threshold)

  • but it’s easy to optimize over a 1D threshold via a grid search

# Synthetic imbalanced dataset (2D for visualization)
X, y = make_classification(
    n_samples=2500,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],
    class_sep=1.4,
    random_state=7,
)

X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=7
)
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7
)

# Standardize using train statistics (low-level)
mean_ = X_train.mean(axis=0)
std_ = X_train.std(axis=0)
std_ = np.where(std_ == 0, 1.0, std_)

X_train_s = (X_train - mean_) / std_
X_val_s = (X_val - mean_) / std_
X_test_s = (X_test - mean_) / std_

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    opacity=0.7,
    labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
    title='Training data (imbalanced)',
)
fig.show()

print('class balance (train):', np.bincount(y_train) / y_train.size)
class balance (train): [0.8973 0.1027]
def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def sigmoid(z):
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)
    return out


def log_loss_from_proba(y_true, p, eps=1e-15):
    y_true = np.asarray(y_true, dtype=float)
    p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
    return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))


def fit_logistic_regression_gd(
    X,
    y,
    *,
    lr=0.2,
    max_iter=2000,
    alpha=0.0,
    tol=1e-8,
):
    '''Binary logistic regression with gradient descent + optional L2 penalty.'''

    Xb = add_intercept(X)
    y = np.asarray(y, dtype=float).ravel()

    n, d = Xb.shape
    w = np.zeros(d)
    history = []

    for _ in range(max_iter):
        p = sigmoid(Xb @ w)
        loss = log_loss_from_proba(y, p) + 0.5 * alpha * np.sum(w[1:] ** 2)
        history.append(loss)

        grad = (Xb.T @ (p - y)) / n
        grad[1:] += alpha * w[1:]

        w_new = w - lr * grad

        if np.linalg.norm(w_new - w) < tol:
            w = w_new
            break
        w = w_new

    return w, np.asarray(history)


def predict_proba_logreg(X, w):
    Xb = add_intercept(X)
    return sigmoid(Xb @ w)
w, loss_hist = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=0.05)

fig = go.Figure()
fig.add_trace(go.Scatter(y=loss_hist, mode='lines', name='train log-loss'))
fig.update_layout(title='Training curve (log-loss)', xaxis_title='Iteration', yaxis_title='Log-loss')
fig.show()

w
array([-3.3621, -1.0781,  0.8771])
def precision_recall_f1_at_thresholds(y_true, y_score, thresholds, *, zero_division=0.0):
    y_true = np.asarray(y_true).astype(int).ravel()
    y_score = np.asarray(y_score, dtype=float).ravel()
    thresholds = np.asarray(thresholds, dtype=float)

    y_true_pos = y_true == 1
    pred_pos = y_score[:, None] >= thresholds[None, :]

    tp = np.sum(pred_pos & y_true_pos[:, None], axis=0)
    fp = np.sum(pred_pos & ~y_true_pos[:, None], axis=0)
    fn = np.sum(~pred_pos & y_true_pos[:, None], axis=0)

    precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
    recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
    f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)

    return precision, recall, f1, tp, fp, fn


p_val = predict_proba_logreg(X_val_s, w)
thresholds = np.linspace(0.0, 1.0, 401)

prec_t, rec_t, f1_t, tp_t, fp_t, fn_t = precision_recall_f1_at_thresholds(
    y_val, p_val, thresholds, zero_division=0.0
)

best_idx = int(np.argmax(f1_t))
t_best = float(thresholds[best_idx])

print('best threshold (val):', t_best)
print('F1 at best threshold (val):', float(f1_t[best_idx]))
best threshold (val): 0.34500000000000003
F1 at best threshold (val): 0.9702970297029703
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=prec_t, mode='lines', name='precision'))
fig.add_trace(go.Scatter(x=thresholds, y=rec_t, mode='lines', name='recall'))
fig.add_trace(go.Scatter(x=thresholds, y=f1_t, mode='lines', name='F1', line=dict(width=3)))

fig.add_vline(x=t_best, line_width=2, line_dash='dash', line_color='black')
fig.update_layout(
    title='Precision / Recall / F1 vs threshold (validation set)',
    xaxis_title='Threshold t',
    yaxis_title='Score',
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()
def confusion_matrix_from_threshold(y_true, y_score, t):
    y_pred = (np.asarray(y_score) >= t).astype(int)
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=1)
    mat = np.array([[tn, fp], [fn, tp]])
    return mat, (tp, fp, fn, tn)


mat_05, counts_05 = confusion_matrix_from_threshold(y_val, p_val, 0.5)
mat_best, counts_best = confusion_matrix_from_threshold(y_val, p_val, t_best)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        f't=0.50 (F1={f1_score_binary(y_val, (p_val>=0.5).astype(int)):.3f})',
        f't={t_best:.2f} (F1={f1_score_binary(y_val, (p_val>=t_best).astype(int)):.3f})',
    ),
)

for col, mat in enumerate([mat_05, mat_best], start=1):
    fig.add_trace(
        go.Heatmap(
            z=mat,
            x=['Pred 0', 'Pred 1'],
            y=['True 0', 'True 1'],
            text=mat,
            texttemplate='%{text}',
            colorscale='Blues',
            showscale=False,
        ),
        row=1,
        col=col,
    )

fig.update_layout(title='Confusion matrices on validation set')
fig.show()

counts_05, counts_best
((44, 0, 7, 449), (49, 1, 2, 448))
# Precision-Recall curve with iso-F1 lines
# (each point corresponds to one threshold)

fig = go.Figure()
fig.add_trace(go.Scatter(x=rec_t, y=prec_t, mode='lines', name='PR curve'))
fig.add_trace(
    go.Scatter(
        x=[rec_t[best_idx]],
        y=[prec_t[best_idx]],
        mode='markers',
        marker=dict(size=10, color='red'),
        name=f'Best F1 (t={t_best:.2f})',
    )
)

f_levels = [0.2, 0.4, 0.6, 0.8]
p_line = np.linspace(0.001, 1.0, 400)
for f in f_levels:
    mask = p_line > (f / 2)
    p = p_line[mask]
    r = (f * p) / (2 * p - f)
    r = np.clip(r, 0, 1)

    fig.add_trace(
        go.Scatter(
            x=r,
            y=p,
            mode='lines',
            line=dict(dash='dot', width=1),
            name=f'F1={f}',
            hoverinfo='skip',
        )
    )

fig.update_layout(
    title='Precision–Recall curve (validation) with iso-F1 lines',
    xaxis_title='Recall',
    yaxis_title='Precision',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
)
fig.show()
# How the threshold changes the *linear* decision boundary
# p = sigmoid(z) >= t  <=>  z >= log(t/(1-t))

def boundary_line(w, t, x1):
    z_thr = np.log(t / (1 - t))
    if np.isclose(w[2], 0.0):
        return None
    x2 = (z_thr - w[0] - w[1] * x1) / w[2]
    return x2


x1 = np.linspace(X_train_s[:, 0].min() - 0.5, X_train_s[:, 0].max() + 0.5, 200)
x2_05 = boundary_line(w, 0.5, x1)
x2_best = boundary_line(w, t_best, x1)

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    opacity=0.6,
    labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
    title='Logistic regression: threshold shifts the decision boundary',
)

if x2_05 is not None:
    fig.add_trace(go.Scatter(x=x1, y=x2_05, mode='lines', name='t=0.50', line=dict(color='black')))
if x2_best is not None:
    fig.add_trace(go.Scatter(x=x1, y=x2_best, mode='lines', name=f't={t_best:.2f}', line=dict(color='red')))

fig.show()

Evaluate on the test set#

We picked \(t^*\) on the validation set to avoid overfitting the threshold.

Now compare:

  • default \(t=0.5\)

  • tuned \(t=t^*\)

p_test = predict_proba_logreg(X_test_s, w)

def report_binary(y_true, p, t):
    y_hat = (p >= t).astype(int)
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_hat)
    prec, rec, f1 = precision_recall_f1_from_counts(tp, fp, fn)
    return {
        'threshold': float(t),
        'precision': float(prec),
        'recall': float(rec),
        'f1': float(f1),
        'tp': int(tp),
        'fp': int(fp),
        'fn': int(fn),
        'tn': int(tn),
    }

rep_05 = report_binary(y_test, p_test, 0.5)
rep_best = report_binary(y_test, p_test, t_best)

rep_05, rep_best
({'threshold': 0.5,
  'precision': 1.0,
  'recall': 0.7884615384615384,
  'f1': 0.8817204301075269,
  'tp': 41,
  'fp': 0,
  'fn': 11,
  'tn': 448},
 {'threshold': 0.34500000000000003,
  'precision': 0.9791666666666666,
  'recall': 0.9038461538461539,
  'f1': 0.94,
  'tp': 47,
  'fp': 1,
  'fn': 5,
  'tn': 447})

5) Using F1 for model selection (simple “optimization” loop)#

F1 is typically used as a selection criterion rather than a differentiable training loss.

Example: tune L2 strength \(\alpha\) for logistic regression by:

  1. fit the model for each \(\alpha\)

  2. pick the threshold \(t\) that maximizes validation F1

  3. choose the best \((\alpha, t)\) pair

alphas = [0.0, 0.01, 0.05, 0.2, 1.0]
thresholds = np.linspace(0.0, 1.0, 401)

results = []
for a in alphas:
    w_a, _ = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=a)
    p_val_a = predict_proba_logreg(X_val_s, w_a)

    _, _, f1_a, _, _, _ = precision_recall_f1_at_thresholds(y_val, p_val_a, thresholds)
    best_idx_a = int(np.argmax(f1_a))

    results.append(
        {
            'alpha': float(a),
            't_best': float(thresholds[best_idx_a]),
            'f1_val_best': float(f1_a[best_idx_a]),
        }
    )

results
[{'alpha': 0.0, 't_best': 0.4875, 'f1_val_best': 0.9702970297029703},
 {'alpha': 0.01, 't_best': 0.4175, 'f1_val_best': 0.9702970297029703},
 {'alpha': 0.05,
  't_best': 0.34500000000000003,
  'f1_val_best': 0.9702970297029703},
 {'alpha': 0.2, 't_best': 0.2575, 'f1_val_best': 0.9702970297029703},
 {'alpha': 1.0, 't_best': 0.155, 'f1_val_best': 0.9702970297029703}]
alpha_vals = np.array([r['alpha'] for r in results])
f1_vals = np.array([r['f1_val_best'] for r in results])

best = results[int(np.argmax(f1_vals))]

fig = go.Figure()
fig.add_trace(go.Scatter(x=alpha_vals, y=f1_vals, mode='lines+markers', name='best val F1'))
fig.update_layout(
    title='Validation F1 after threshold tuning vs L2 strength',
    xaxis_title='alpha (L2 strength)',
    yaxis_title='best validation F1',
)
fig.show()

best
{'alpha': 0.0, 't_best': 0.4875, 'f1_val_best': 0.9702970297029703}

6) Multiclass F1: macro vs micro vs weighted#

For multiclass single-label classification, F1 is usually computed by turning each class into a one-vs-rest problem.

  • macro: average F1 across classes (treat each class equally)

  • weighted: average F1 across classes weighted by class support

  • micro: compute global TP/FP/FN across classes before computing F1

Note: in single-label multiclass classification, micro F1 equals accuracy.

y_true_mc = np.array([0, 0, 0, 1, 1, 2, 2, 2, 2])
y_pred_mc = np.array([0, 2, 0, 1, 0, 2, 2, 1, 2])

labels, per_class = f1_score_multiclass(y_true_mc, y_pred_mc, average=None)
print('labels:', labels)
print('per-class F1:', per_class)

for avg in ['macro', 'micro', 'weighted']:
    ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
    sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
    print(f"{avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")
labels: [0 1 2]
per-class F1: [0.6667 0.5    0.75  ]
macro   : ours=0.638889 sklearn=0.638889
micro   : ours=0.666667 sklearn=0.666667
weighted: ours=0.666667 sklearn=0.666667

Pros / cons and when to use F1#

Pros

  • Good default when the positive class is rare and you care about both FP and FN.

  • Single number that summarizes the precision–recall tradeoff.

  • Common in information retrieval, detection tasks, and many imbalanced classification settings.

Cons / limitations

  • Ignores true negatives: can be misleading if performance on the negative class matters.

  • Threshold-dependent: you must pick a threshold (or compare across thresholds).

  • Not a proper scoring rule (unlike log-loss / Brier), so it’s not ideal for probability calibration.

  • Not differentiable in model parameters → usually not used as a direct training loss.

  • Can hide tradeoffs: the same F1 can come from very different (precision, recall) pairs.

Good use cases

  • Highly imbalanced binary classification where the negative class is huge (fraud, churn, defect detection).

  • Search / ranking systems after choosing an operating point.

  • Segmentation / detection tasks (F1 is closely related to the Dice coefficient).

Common pitfalls + diagnostics#

  • Undefined divisions: if the model predicts no positives, precision is undefined. Decide a policy (zero_division=0 is common).

  • Wrong averaging in multiclass: macro emphasizes minority classes; weighted tracks overall distribution.

  • Class imbalance doesn’t magically disappear: F1 helps compared to accuracy, but you still need proper validation and often threshold tuning.

  • If you need to compare models as rankers, prefer PR curves / average precision instead of a single F1 at one threshold.

  • If FP and FN have different costs, prefer \(F_\beta\) or an explicit cost-sensitive metric.

Exercises#

  1. Implement \(F_\beta\) in NumPy and verify it against sklearn.metrics.fbeta_score.

  2. For the logistic regression example, compare the threshold that maximizes F1 vs the threshold that maximizes accuracy.

  3. Create an extremely imbalanced dataset (e.g. 99.5% negatives) and compare accuracy vs F1.

  4. For multiclass, create a dataset with one rare class and compare macro vs weighted F1.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

  • scikit-learn user guide (precision/recall/F-score): https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics